fix: prevent munge startup race on compute node boot#5423
fix: prevent munge startup race on compute node boot#5423mtibben wants to merge 3 commits intoGoogleCloudPlatform:developfrom
Conversation
On first boot, systemd starts munge.service before the startup script has a chance to deploy /etc/munge/munge.key via setup_network_storage(). Munge fails and enters a failed state, which can leave slurmd unable to authenticate with slurmctld even after setup_compute() restarts munge with the key in place. Install a systemd drop-in (ConditionPathExists=/etc/munge/munge.key) before deploying the key so munge is skipped (inactive) rather than failing at boot. Also call systemctl reset-failed munge to clear any failed state from the race before the restart.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a race condition occurring during compute node boot where the munge service attempts to start before its required key is deployed. By introducing a conditional check via a systemd drop-in, the service remains inactive rather than entering a failed state, ensuring reliable authentication for slurmd upon subsequent startup. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request modifies the setup script to include a systemd override for the munge service, ensuring it only starts once the munge key is present to avoid boot-time race conditions. The review feedback suggests using the available dirs.munge object to reference the munge key path rather than hardcoding it, enhancing code maintainability.
...unity/modules/scheduler/schedmd-slurm-gcp-v6-controller/modules/slurm_files/scripts/setup.py
Outdated
Show resolved
Hide resolved
|
/gcbrun |
|
Hmmmm I've discovered this PR is not a complete fix The runtime drop-in in For a complete fix, the drop-in would need to be in the base SLURM GCP image - I've created a PR at GoogleCloudPlatform/slurm-gcp#336 |
On compute node first boot, systemd starts
munge.service(which is enabled in the SLURM image) before the startup script has deployed/etc/munge/munge.keyviasetup_network_storage(). Munge fails and enters afailedsystemd state.The existing
systemctl restart mungeinsetup_compute()runs after the key is deployed, but a priorfailedstate can leave the munge socket unreliable, causingslurmdto be unable to authenticate withslurmctldeven though both services appear running. This manifests as nodes stuck inNOT_RESPONDING+POWERING_UPwith jobs hanging inCONFIGURING.This fix installs a systemd drop-in for
munge.servicewithConditionPathExists=/etc/munge/munge.keybeforesetup_network_storage()runs. This causes systemd to skip (not fail) the munge autostart when the key is absent — leaving itinactiverather thanfailed. Asystemctl reset-failed mungealso clears any failed state from the current boot's race before the key-deployment restart.